In this analysis, we aim to address several key questions using data visualization techniques.
Visualizing Data Distribution We will examine the distribution of heart rate data during different activities/exercises.
Effects of Exercise on Heart Rate We will investigate how heart rate changes after a month of regular exercise. Is there a noticeable difference?
Correlation Between Power and Heart Rate: We will explore the relationship between power output and heart rate to see if they are linked.
Heart Rate Variations Throughout Exercise We'll analyze how heart rate fluctuates during exercise over time.
Heart Rate and Distance We will examine how heart rate is affected by the distance covered in various types of exercises.
Heart Rate , Power and cadence relationship We will analyse the relationship between heart rate, power and cadence.
Geographical Information Systems Map - We will visualize the GIS map to track the route taken for diffent activity.
For the purpose of this analysis, I have made the following assumptions:
Age Group The heart rate analysis focuses on individuals aged 40-45.
Running Speed The average speed for running is estimated to be around 12 km/h any speed above this is considered bicycling.
Power Measurement Power is defined as a measurement of the work done on the bike. This means that both the effort applied to the pedals and the cadence (the speed at which you are pedaling) contribute to the overall power output.
# import the required library
import warnings
import numpy as np
import pandas as pd
import statsmodels.api as sm
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from mpl_toolkits.mplot3d import Axes3D
# as we are analysing data based on single file add as an constant
FILE_PATH = 'assets/strava.csv'
#supress any pandas warning for inplace update
warnings.filterwarnings("ignore")
def load_and_pre_process_data(file_path):
# load the data in pandas dataframe
df = pd.read_csv(file_path)
# create the buckets to find the differnt exercise
buckets = [ i for i in range(100 , 2, -1 )]
# column name required for analysis
columns = ['timestamp', 'distance', 'speed', 'heart_rate', 'Power', 'position_lat' , 'position_long', 'enhanced_altitude', 'cadence']
# filter the required columns
res = df[columns]
# parse the colums are required datatype
res['timestamp'] = pd.to_datetime(res['timestamp'])
res['date'] = res['timestamp'].dt.date
res['date'] = pd.to_datetime(res['date'])
# substract the next/prev value form current to find the start/end time of exercise
res['next'] = res['distance'].diff(periods=-1)
res['prev'] = res['distance'].diff()
# assign a bucket to each exercise
res['bucket'] = res['prev'].apply(lambda x : buckets.pop() if x < 0 else np.nan)
# assign 1 to first bucket
res['bucket'].iloc[0] = 1
#forward will the bucker number to identify a exercise duration
res['bucket'] = res['bucket'].ffill()
return res
### group the data to find the activity type and other analyis
def process_and_agg_data(raw_data):
# get the group for different bucket
groups = raw_data.groupby(['bucket'] , as_index=False)
# Distance traveled during each activity and scale up to KM.
data = groups[['distance']].agg(lambda x : 0.001*(x.iloc[-1] - x.iloc[0]))
## Activity duration and convert it to hour
data['duration'] = groups[['timestamp']].agg(lambda x : (x.iloc[-1] - x.iloc[0]).total_seconds()/3600)['timestamp']
## calculate average speed for the activity that wwe will use find activity.
data['avg_speed'] = data['distance'] / data['duration']
# start time of activity
data['start_time'] = groups[['timestamp']].first()['timestamp']
#minimum heart rate during activity
data['min_hr'] = groups[['heart_rate']].min()['heart_rate']
#maximum heart rate during activity
data['max_hr'] = groups[['heart_rate']].max()['heart_rate']
return data
# update the data with activity type
def data_updated_activity(file_path):
raw_data = load_and_pre_process_data(file_path)
data = process_and_agg_data(raw_data)
# bicycle activity
bicycle_buckets = data[data['avg_speed']>12]['bucket'].to_numpy()
# update activity based on assumption that average running speed is not more then 12 KM/Hour
raw_data['activity_type'] = raw_data['bucket'].apply(lambda x : 'Bicycle' if x in bicycle_buckets else 'Running' )
return raw_data
Figure 1 - Visualizing Data Distribution for Heart Rate¶Violin plots
Distribution insight - Violin plot provides a full view of the data distribution. This includes the probability density of heart rates at different levels. It also offers insight into how heart rates are spread out across a range of values.
Multiple Modes - If the heart rate data has multiple peaks (modes), a violin plot will show these effectively.
Aesthetic Appeal - Violin plots are visually appealing and can be more engaging to an audience, making it easier to convey our findings.
Clear Indicators of Density - The width of the violin at different heart rate levels represents the density of observations at that level, making it intuitive to see how common certain heart rate levels are.
# Load data for violin plot
data = load_and_pre_process_data(FILE_PATH)
fig = plt.subplots(figsize = (9,7))
sns.violinplot(data=data , y='heart_rate').set(title = 'Distribution of Heart Rate' , ylabel = 'Heart Rate' )
plt.legend(['Min - ' + str(data['heart_rate'].min()),
'Max - ' + str(data['heart_rate'].max()),
'Mean - ' + str(int(data['heart_rate'].mean()))]
, fontsize = 'x-large')
plt.show()
The violin plot above clearly illustrates the distribution of heart rate data. The analysis reveals that the heart rate density peaks around 120, 140, and 160, with the highest density occurring at approximately 140. This observation is further supported by the mean heart rate of 134 bpm.
Figure 2 - Correlation Between Power and Heart Rate¶Scatter plot
Identifying Patterns - Scatter plots display individual data points on two axes (one for heart rate and the other for power output). This allows us to visually assess the relationship between the two variables. We can quickly identify trends, clusters, or patterns, helping us to see how heart rate changes with varying power outputs.
Outlier Analysis - Scatter plots helps identify outliers data points that fall outside the general pattern of results.
Variation Insights - The spread of points in the scatter plot indicates variability in heart rate responses at the same power output. This will be helpful for undestanding differences in fitness levels, conditions.
# load the data again to avoid any data collision if cells are not run in order
power_heart_rate = load_and_pre_process_data(FILE_PATH)[['heart_rate', 'Power']].dropna().reset_index()
#add a new scatter plot
fig = plt.subplots(figsize = (10,5))
sns.scatterplot(data = power_heart_rate , x = 'Power' , y = 'heart_rate').set(title = 'Scatterplot of heart rate vs power' , ylabel = 'Heart rate')
#regression analysis to draw a line
X = sm.add_constant(power_heart_rate['Power'])
model = sm.OLS(power_heart_rate['heart_rate'] , X).fit()
predict = model.predict(X)
plt.plot(power_heart_rate['Power'] ,predict , color = 'purple' , label = 'Regression line ')
plt.legend(loc="upper left")
plt.show()
In this analysis, we examined the correlation between power output and heart rate. The scatter plot reveals a noticeable concentration of data points, indicating a relationship between them. As power output increases, heart rate tends to rise as well, suggesting that higher levels of exertion correspond with elevated heart rates, which is consistent with what would be expected in physical activity.
Figure 3 - Effects of Exercise on Heart Rate¶Line Plot - Line plot are good for plotting the timeseries data to find the pattern, trend over time.
I will be plotting the min/max heart rate for all the exercise to find any pattern or trend in data
# get agg data
running = process_and_agg_data(load_and_pre_process_data(FILE_PATH))
### Analysis assumes speed less then 12KM/Hour as running.
running = running[running['avg_speed']<12]
fig = plt.subplots(figsize = (12,8))
#use custom ticks window of 5
y_ticks = [5 * i for i in range(40)]
x_ticks = running['start_time'].dt.date.apply(lambda x : str(x))
#plot min heart rate overtime for activity
sns.lineplot(x = 'start_time' , y = 'min_hr' , data = running,
linewidth = 2.5, marker = 'o' , markersize = 10).set(
title = 'Heart rate over time' , xlabel = 'Exercise date', ylabel = 'Heart rate' , yticks = y_ticks , xticks = x_ticks)
#plot max heart rate overtime for activity
sns.lineplot(x = 'start_time' , y = 'max_hr' , data = running,
linewidth = 2.5, marker = 'o' , markersize = 10).set(
title = 'Heart rate over time' , xlabel = 'Exercise date', ylabel = 'Heart rate' , yticks = y_ticks, xticks = x_ticks)
# average heart rate for healthy person is between 90-153 for age group 40-45
# Plot line for lower heart rate limit for age round
sns.lineplot(x = 'start_time' , y = 90 , data = running,
linewidth = 2.5 , label = 'Lower heart rate for age group 40-45')
## Plot line for upper heart rate limit for age round
sns.lineplot(x = 'start_time' , y = 153 , data = running,
linewidth = 2.5 , label = 'Upper heart rate for age group 40-45')
# rotate the ticks for clearn x axis
plt.xticks(rotation=85)
line = plt.gca().lines
plt.fill_between(line[0].get_xdata(),line[0].get_ydata(), line[1].get_ydata(), color='grey', alpha=.5)
plt.legend(loc="upper left")
plt.show()
Result¶Figure 5 Heart Rate and Distance¶lineplot - Visualize the distance and heart rate for different activity to find how different activities effects heart rate.
#load data and update activity type
data = data_updated_activity(FILE_PATH)
fig = plt.subplots(figsize = (15,8))
g = sns.lineplot(x = 'distance' , y = 'heart_rate' , data = data, hue = 'activity_type', style = 'activity_type',
linewidth = 2.5, marker = 'o' , markersize = 10).set(
title = 'Line plot - distance and heart rate for different activity' , xlabel = 'distance covered in meter', ylabel = 'Heart rate')
plt.show()
Figure 6 - Heart Rate , Power and cadence relationship¶3d Plot
3D scatter plot allows us to add a third dimension to analysis/visualization, we will visulize three variable Power, heart_rate and cadence to find the relationship.
def plot_3d(data):
# add a new figure with size 6 6
fig = plt.figure(figsize = (6,6))
ax = fig.add_subplot(projection='3d')
#add a 3d scatter plot
artists=ax.scatter(data["Power"], data["heart_rate"], data["cadence"],
s=5, c=data["cadence"], cmap='Blues')
#add lable to colorbar
plt.colorbar(artists).set_label("Power (watts)")
# add lable to access
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_zlabel('cadence')
plt.show()
data = data_updated_activity(FILE_PATH)[['heart_rate' , 'Power' , 'bucket' , 'cadence']].dropna()
for bucket in data['bucket'].unique():
bucket_data = data[data['bucket'] == bucket]
plot_3d(bucket_data)
The plot indicates that as cadence increases, there is a corresponding increase in both heart rate and power output. This concentration suggests a predictable pattern of response during physical activities, where higher cadence results in elevated heart rates and higher power.
Figure 7 Geographical Information Systems¶We will use the GIS lib folium visualize route for all the activity on the map. GIS plot provides intrective way to explore the geographical. We will use it explore details about the route taken during exercise. Folium support intrective map which we can use to explore area near the activity to find a optimial route for next activity.
# prepare map attribute
def prepare_gis_map(data , title):
#prepare the data to map
data = data * (180 / 2**31 )
start = [data["position_lat"].iloc[0], data["position_long"].iloc[0]]
end = [data["position_lat"].iloc[-1], data["position_long"].iloc[-1]]
# start lat/long
m = folium.Map(location= start , zoom_start=14)
#add title to map
title_html = f'<h3 align="center" style="font-size:22px;color:blue; font-family:Brush Script MT,cursive"><b>{title} </b></h3>'
m.get_root().html.add_child(folium.Element(title_html))
# add a start marker
folium.Marker(start, popup="Start" , icon=folium.Icon("green") ).add_to(m)
# add a end marker
folium.Marker(end, popup="Stop" , icon=folium.Icon("green")).add_to(m)
route = folium.PolyLine(locations=zip(data["position_lat"], data["position_long"]),
weight=5, color='blue').add_to(m)
return m
data = data_updated_activity(FILE_PATH)[['position_lat' , 'position_long' , 'bucket' , 'activity_type' , 'timestamp']].dropna()
for bucket in data['bucket'].unique():
bucket_data = data[data['bucket'] == bucket]
title = f'''Activity start time - {bucket_data['timestamp'].iloc[0]}
Activity Type - {bucket_data['activity_type'].iloc[0]}'''
display(prepare_gis_map(bucket_data[['position_lat' , 'position_long' ]] , title))
print